home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Mac Magazin/MacEasy 21
/
Mac Magazin and MacEasy Magazine CD - Issue 21.iso
/
Wissenschaft & Technik
/
yorick_docs folder
/
yorick_docs
/
FILE_FORMATS
next >
Wrap
Text File
|
1996-02-28
|
91KB
|
1,898 lines
Binary Data File Formats and Descriptions
-----------------------------------------
------------------------------------------------------------------------------
1. Introducing the netCDF and PDB formats
-----------------------------------------
Scientific computing now takes place in a network of high
performance UNIX workstations and mainframes. Many sites include
machines from several manufacturers on a single network. In such a
world it is crucial that programs be portable -- that is, that a
program be written in a language and in a style which enables it to be
compiled and run on as wide a variety of machines as possible. It is
just as crucial that the results of a portable program be written in a
portable form -- that is, in a form legible by any machine on the
network. In order to save space and run fast, results of scientific
calculations are written to disk in a binary format. Unlike text
files, binary files are not guaranteed to make sense when read back by
a machine other than the machine on which they were written. This is
because the binary format used to represent numbers varies from one
computer manufacturer to the next.
Several solutions to this problem have emerged in the past few
years. Two existing solutions -- netCDF files and PDB files -- will
be described in detail here. A third, HDF vertex set format, is not
too different in kind, and will not be discussed. After a thorough
examination of the strengths and weaknesses of these two formats, a
"data language" is described which is capable of describing a very
general binary data file -- including any netCDF or PDB file as a
special case.
The netCDF format is simple and widely used, but its authors
(Unidata, sponsored by NSF) do not describe the actual disk format of
the data in the documentation that comes with the software. This is a
peculiar omission, since no data format can be regarded as truly
portable without being fully documented. Furthermore, anyone can see
that a heavily used data format cannot be changed in an incompatible
way in the future -- any changes must take the form of additional
structural information which does not conflict with the existing file
format. As you will see, the netCDF format is easily flexible enough
to allow such additions without affecting the basic intelligibility of
a netCDF file.
The PDB file format is substantially more complicated and
ambitious. It has been heavily used, but only at the Lawrence
Livermore National Laboratory, where it was designed by Stewart Brown.
PDB is a drastically smaller programming effort than netCDF, and it
has not been cleaned up to remove the evidence of its evolution. Like
netCDF, the underlying format of a PDB file is not described in the
documentation for the package. Despite the historical imperfections
remaining in the PDB format, its full disclosure is just as important
as full disclosure of the netCDF format.
The documentation for both netCDF and PDB concentrates on the
programming interface for reading and writing the files. This is
certainly understandable, but any file format could be accessed by
many interfaces, and any interface could be realized using many file
formats. The file format is the data; the interface merely represents
the data. Only a complete description of the binary file format
actually used by netCDF and PDB files makes any discussions of the
merits and demerits of either format intelligible.
In brief, a netCDF file consists of a small descriptive header,
followed by the binary data described in the header. Both the header
and the data are written to disk using the XDR (eXternal Data
Representation) library invented by Sun. The XDR library converts
integer and floating point numbers of various sizes into the IEEE
standard floating point representation and big-endian order favored by
Sun and many other computer manufacturers. Each variable is of type
byte, char (same as byte), short (2-byte integer), long (4-byte
integer), float (4-byte real), or double (8-byte real), each may be an
array of zero or more dimensions, and each may have any number of
named attributes with associated attribute values. The variables may
be divided into a "non-record" group, and a "record" group. "Record"
variables are physically grouped at the end of the data section, and
the layout of this subsection may be repeated an "unlimited" number of
times to capture time-varying data.
Similarly, a PDB file begins with a very small header describing
the primitive data formats and specifiying the address of a longer
descriptive section at the end of the file. The binary data itself
follows the small header. Next comes a section in which any compound
data types are named and defined in terms of the primitive data types
(char, short, int, long, float, double, and pointered data).
Next comes a symbol table, where a variable name, type, and dimension
information are associated with a disk address in the data section.
Finally, a special section allows for corrections to and extensions of
the format not envisioned when the other descriptive components of the
format were designed. (Effectively, the PDB format has been a research
tool for studying various notions about portable binary files.)
The biggest difference between the netCDF and PDB strategies for
data-portability is their handling of the primitive data formats. The
netCDF strategy is to mandate the exact representation of numbers
within the file. The PDB strategy is to describe the primitive data
formats themselves, using a parameterization which is general enough
to cover all currently interesting machines. Both strategies have
strengths and weaknesses.
However, there are two advantages to the PDB strategy which deserve
special mention: First, since PDB files can use the number
representations native to any machine, they can be written and read at
a similar speed on any platform; there is no advantage to owning a
machine with number representations matching those of a Sun. Second,
PDB files can always represent a floating point number exactly (to the
last bit) as long as they are read on a machine with the same number
representations as the one on which they were written. For a netCDF
file, it is in principle possible that floating point data read back
in will differ slightly from what was written out, owing to a
round-trip conversion through XDR format. In practice, this is rarely
a problem, at least for the number representations in common use in
the current generation of machines.
Other differences arise because of the fact that the netCDF symbol
table comes before the data, while the PDB symbol table comes
afterward. The conflict here is between wanting to be able to add
variables to the file after some of the variable data has already been
written (PDB can, netCDF can't), and robustness in the sense of always
having the data description present in case the code or machine dies
after some, but not all, of the data has been written (netCDF is
robust, PDB isn't).
Finally, there is a considerable difference between the kinds of
objects easily described using a netCDF as opposed to a PDB format.
To a certain degree, this is really a matter of the programming
interface, which can obviously be built in such a way as to add a
level of abstraction not explicitly represented in the underlying file
format. Thus, netCDF has "attributes" associated with each variable,
and makes a simple provision for handling time-varying data.
Conversely, PDB allows for compound data types (like C structs), and a
simple disk representation of pointers.
Despite all of these differences, one is struck by the common
simple idea underlying both the netCDF and PDB file formats: The
section of the file containing the binary data itself and the section
which describes the binary data are completely separate. There is no
formatting information mixed into the binary data. A second common
design feature is that the symbol table information associates a
variable name, type, and dimensions with the disk address where that
variable begins. Because of this commonality, it is easy to use the
PDB machinery to read netCDF files, or the netCDF machinery to read
PDB files written with Sun-like primitive data formats, simply by
providing an alternate routine to open the file and create the data
structure the rest of the package uses to describe the file
internally.
The common features of netCDF and PDB, as well as their respective
strong points, motivate the design of the generic binary data
description language described in the last section of this report.
This language is christened "Clog" for Contents Log. A generic
programming interface capable of reading either netCDF or PDB files
can be based on Clog. Moreover, the data description language can
make it possible to process the data in a completely arbitrary binary
file with a single programming interface.
A portable scientific program which both reads its input from and
writes its output to data-portable self-descriptive binary files,
allows for the most efficient use of a heterogeneous network of high
performance computing engines. The netCDF and PDB file formats are
both suitable choices for the required binary I/O files, although each
has definite strengths and weaknesses. A general binary data
description language is capable of describing either netCDF or PDB
files, and can draw on the strengths of each paradigm.
------------------------------------------------------------------------------
2. The netCDF file format
-------------------------
A netCDF file consists of a shallow hierarchy of data types based on
the primitive data types defined by the XDR standard. This standard is
fully described in the documents released by Sun Microsystems:
XDR: eXternal Data Representation Standard, RFC1014
eXternal Data Representation: Sun Technical Notes
XDR(3N) UNIX man page
available on Internet by anonymous FTP to ftp.uu.net:
/packages/bsd-sources/lib/librpc/doc/xdr.rfc.ms.Z
/packages/bsd-sources/lib/librpc/doc/xdr.nts.ms.Z
/packages/bsd-sources/lib/librpc/man/man3/xdr.3n.Z
the man page is available online on many UNIX systems
The netCDF software itself is available on Internet by anonymous FTP to
unidata.ucar.edu, in the file /pub/netcdf/netcdf.tar.Z. This includes
complete documentation of the programming interface provided by Unidata
to write and read netCDF files.
The primitive data types used in a netCDF file are:
opaque - any number of bytes, padded with 0's to a multiple of 4
shorts - any number of 2-byte integers in big-endian order,
padded with 0's to a multiple of 4 bytes
(implemented using the XDR_PUTBYTES and XDR_GETBYTES
macros, 4 bytes at a time -- would have been cleaner
and equivalent to implement using 4-byte opaque)
int, enum - one 4-byte integer in big-endian order
long - one signed 4-byte integer in big-endian order
float - one 4-byte IEEE floating point number in big-endian order
double - one 8-byte IEEE floating point number in big-endian order
u_long - one unsigned 4-byte integer in big-endian order
The netCDF file itself is at the opposite end of the type hierarchy,
with the following intermediate layers:
NC_array - a counted list of objects of any other type
The objects in an NC_array may be of variable length,
so total length of an NC_array must be calculated by
summing the lengths of its elements.
NC_var - describes a variable in the file
Each variable has a name, a data type (one of byte,
char, short, long, float, or double), zero or more
dimensions, zero or more attributes, and a disk
address.
NC_dim - Dimensions in a netCDF file are all named and shared
among all variables in the file. Each dimension has
a name and a length.
NC_attr - Each attribute has a name and a value; the value can
be zero or more objects of any of the primitive data
types (byte, char, short, long, float, or double).
Attributes can belong to one variable, or to the
file as a whole.
NC_iarray - a u_long count followed by that many ints
NC_string - a u_long count followed by that many chars, written as
an opaque
2A. Whole-file format
-----
The entire netCDF file has the following data structure:
u_long 0x43444601 "CDF\001", netCDF file magic number
u_long numrecs number of records
NC_array NC_dim dims name<-->dimension length associations
NC_array NC_attr attrs global attributes for this file
NC_array NC_var vars description of variables in this file
<any> data the variables described by vars
Here, the first column gives the type of the data identified in the
second column, and described in the third column. Each object up to
the data section is written immediately after the preceding item; to
make sense of this part of the file, it must be read back sequentially
-- that is, first the dims, then the attrs, then the vars. The vars,
however, contain the disk addresses of all the variables in the data
section, so the data in the file can be randomly accessed after vars
has been read.
Note that additional descriptive information could be added after
vars without affecting the ability of the netCDF software to read the
file.
2B. NC_array format
-----
enum type 0 unspecified
1 byte
2 char
3 short
4 long
5 float
6 double
7 bitfield (private)
8 string (private, NC_string)
9 iarray (private, NC_iarray)
10 dimension (private, NC_dim)
11 variable (private, NC_var)
12 attribute (private, NC_attr)
u_long count number of objects in array
<any> objects byte, char written as xdr_opaque
of count bytes
short written as XDR_PUTBYTES
of count shorts (byte pairs)
all others written as a sequence
of count objects
Note that the length of an NC_array is 8 bytes plus the aggregate
length of the array elements.
2C. NC_var format
-----
NC_string name name of the variable
NC_iarray assoc list of 0-origin indices into the
array of dimensions (dims) for this
file
The dimensions in the list are listed
slowest varying first. If the slowest
dimension is the UNLIMITED dimension,
this is a record variable.
NC_array NC_attr attrs attributes for this variable
enum type data type for this variable (values as
for NC_array above)
u_long len total number of bytes on disk
The length of a netCDF record is the
sum of the len fields of all record
variables.
u_long begin disk address
(This is calculated on the basis of the
known data lengths in the Unidata code,
NOT obtained from xdr_getpos.)
All non-record variables precede all
record variables, to allow a the
block of record variables to be treated
as an array of an indeterminate
number of record structure instances.
2D. NC_dim format
-----
NC_string name name associated with dimension
long size number of elements along dimension
Note: A netCDF file may have zero or one UNLIMITED dimension, which is
marked by size==0. If a variable has the UNLIMITED dimension, that must
be its slowest varying dimension. Such variables are physically placed
at the end of the data section, and numrecs copies of this "record section"
exist at the end of the data section. (The record variables may occur
anywhere in the vars list of variables for the file.)
2E. NC_attr format
-----
NC_string name name of attribute
NC_array <any> data value of attribute
Notes: The data must be one of the "public" types (byte, char, short,
long, float, or double). An NC_attr written to the attrs array at the
beginning of the netCDF file is a "global" attribute which applies to
the whole file. An NC_attr written to the attrs array in an NC_var
applies only to that variable.
2F. NC_string format
-----
u_long count number of characters in string
opaque values count characters
Note: The count does NOT include any trailing '\0' character; a count
of 0 is interpreted as (NC_string *)0, NOT as a zero-length string.
2G. NC_iarray format
-----
u_long count number of 4-byte ints in array
int values count ints (written sequentially)
Note: The count can be zero.
------------------------------------------------------------------------------
3. The PDB file format
----------------------
The self-descriptive information in a netCDF file is stored using a
variety of data types. Thus, a name is a u_long count followed by the
actual characters as an opaque, a disk address is a u_long, and a data
type is an enum. In contrast, the self-descriptive information in a
PDB file is mostly character encoded, with certain characters set
aside as delimiters of various sorts. Hence, a disk address or a size
is converted to the characters of the equivalent decimal number in the
PDB file self-description.
To describe such character encoded data, the following discussion
adopts a notation based on the format argument to the standard C
library routines printf and scanf. The meaning of this notation is as
follows: A quoted string represents a consecutive sequence of 8-bit
bytes containing the ASCII representations for the characters in the
string. Thus,
"Hello"
represents the five bytes 0x48, 0x65, 0x6c, 0x6c, 0x6f, in that order.
The two characters "\" and "%" are exceptions to this rule.
A "\" in a format string introduces an escape sequence which
represents a single non-printable ASCII character. The only escape
sequences required for the following discussion are "\n", "\t",
"\001", and "\002". These have the following meanings:
"\n" means a newline character, which can have any single one
of the three values 0x0a (ASCII line feed), 0x0d (ASCII
carriage return), or 0x1f (ASCII unit separator)
This non-unique choice for a delimiter character is an
inconvenient leftover from an early implementation of
the original PDBLib programming interface.
"\t" means a tab character, 0x09.
"\001" means and ASCII SOH character, 0x01.
"\002" means and ASCII STX character, 0x02.
The "%" character in a format string introduces an escape sequence
which represents the characters produced by converting data into a
printable form. The only such format conversions required in the
following discussion are:
"%d" means the decimal equivalent of an int value, that is,
a sequence of digits possibly preceded by a minus "-".
"%ld" is the same thing for a long value
"%s" means zero or more characters in a null-terminated string
of ASCII characters
Because the bytes 0x01, 0x02, 0x0a, 0x0d, and 0x1f ("\001", "\002",
and "\n") are used as delimiters, these five characters may not occur
in any string output with a "%s" in the PDB file format. A sixth
character, 0x00, may not occur by the definition of "%s". As with the
netCDF file format, particular programming interfaces to read and
write PDB files may impose stricter limitations on the set of
characters which are legal in variable names and data type names.
A discussion of restrictions of this sort will follow the PDB file
format description.
3A. Whole-file format
-----
"!<<PDB:II>>!\n" HeadTok 13 bytes of identification
byte count byte count of prim_info + 1
normally count=
24+sizeof(float)+sizeof(double)
byte[count-1] prim_info parameterizations of short, int, long,
float, double, and * primitive types,
giving size, byte order, and floating
point layout
"%ld\001" float_bias exponent bias for float type
"%ld\001\n" double_bias exponent bias for double type
"%ld\001" chart_addr file byte address of structure chart
"%ld\001\n" symtab_addr file byte address of symbol table
<any> data the binary data
PDB_chart chart the structure chart defining compound
data types in terms of primitive data
types and simpler compound types,
begins at byte chart_addr of file
PDB_symtab symtab the symbol table associating a variable
name, type, and dimensions with a disk
address,
begins at byte symtab_addr of file
PDB_extras extras corrections to and extensions of the
PDB file format,
begins at byte immediately following
the symtab
The prim_info array is broken down as follows:
byte[6] sizeof(void *), sizeof(short), sizeof(int),
sizeof(long), sizeof(float), sizeof(double)
the number of bytes of six of the
seven predefined primitive data types
sizeof(char) is always 1 byte
byte[3] orderof(short), orderof(int), orderof(long)
the byte order of the three multibyte
integer data types, 1 if the most
significant byte is first, 2 if the
least significant byte is first
byte[sizeof(float)] permutation of bytes for float type
byte[sizeof(double)] permutation of bytes for double type
byte[7] bitsof(float) bit sizes and addresses of sign,
exponent, and mantissa in a float
byte[7] bitsof(double) bit sizes and addresses of sign,
exponent, and mantissa in a double
The meaning of the permutation and bitsof(...) for the float and
double types is fully described below in the section on PDB
parameterization of floating point layout.
3B. PDB_chart format
-----
The structure chart consists of a series of structure
definitions, representing the various compound data types used to
describe the data in the file. Each structure definition begins with:
"%s\001" base_type_name
the name of the compound data type
"%ld\001" size byte size of one instance of the
data structure on disk
The definition continues with one member descriptor per member of
the data structure:
"%s\001" descriptor basically "type name(dimensions)",
described in detail below
An array of 12 arrays of 5 doubles
called "junk" would have the
descriptor "double junk(12,5)".
An individual structure definition ends with a newline character:
"\n" end_def end of structure definition is
thus always "\001\n"
The end of the entire structure chart is marked by:
"\002\n" end_chart This may occur before any structure
definitions, in which case all of
the variables in the file must have
one of the primitive data types.
Data can be either an instance of one of the primitive data types,
an instance of a compound data type, or a pointer to an object. The
data type of a pointer specified as a <full_type> string, which has
the format
<full_type> is
<ws><base_type_name><indirection_level>
where
<ws> is zero or more of the whitespace characters space or tab,
that is " " or "\t"
<indirection_level> is zero or more asterisk or whitespace
characters, that is "*" or " " or "\t". The level of
indirection is the number of "*" characters; any
<full_type> with a level of indirection greater than
zero represents a pointer.
The general format of a member descriptor is:
<full_type><ws><member_name><dimlist>
where
<ws> is zero or more whitespace (" " or "\t") characters,
but at least one such character if the <indirection_level>
field of the <full_type> has zero characters
<full_type> is the data type of this member; its <base_type_name>
is either a primitive data type name, or the name of a
previously defined compound data type,
<member_name> is the member name associated with this descriptor,
and <dimlist> is either zero characters (if the member is a scalar), or
<open_dimlist><dimlist_interior><close_dimlist>
where
<open_dimlist> is either "(" or "[",
<close_dimlist> is either ")" or "]", and
<dimlist_interior> is a comma "," delimited list of
or "%ld" length number of elements along
this dimension
or "%ld:%ld" origin, max_i origin is a suggested
minimum index value along
this dimension, and max_i
is origin+length-1, the
maximum index value along
this dimension
The <dimlist_interior> may contain whitespace (" " or "\t")
characters anywhere except within the "%ld" fields.
For multidimensional lists, the dimensions are listed
slowest varying first (but see "Major-Order" in PDB_extras below).
The names of the primtive data types are:
"char" same as C language char
"short" same as C language short
"integer" same as C language int
"long" same as C language long
"float" same as C language float
"double" same as C language double
"*" similar to void* in C language, but requires a pointee
type when used as the type of a member or variable
3C. PDB_symtab format
-----
The symbol table consists of a series of variable definitions which
specify the variable name, data type, dimensions, and disk address.
Each defintion has the following format:
"%s\001" name the name of the variable
"%s\001" full_type the full data type name
This is of the form <full_type>
as described in PDB_chart above.
"%ld\001" number the total number of full_type
objects, which is the 1 or the
product of the dimension lengths
"%ld\001" address the byte address of the first
byte of this data in the file
The variable definition continues with one (origin, length) pair for
each dimension associated with the variable. As for the dimensions in
a dimension descriptor, the slowest varying dimension is listed first
(but see "Major-Order" in PDB_extras below).
"%ld\001" origin suggested minimum index value for
this dimension
"%ld\001" length number of elements along this
dimension (NOT maximum index value
as in member descriptor)
The variable definition concludes with a newline:
"\n" end_def end of variable definition is
thus always "\001\n"
The end of the entire symbol table is marked by a second consecutive
newline:
"\n" end_symtab end of symbol table is thus always
"\n\n" (unless it is empty, which
is not a very interesting case)
3D. PDB_extras format
-----
The extras section begins with the byte immediately following
end_symtab, the end of the symbol table. The extras section consists
of a sequence of extra blocks. Each extra block consists of a marker
of the form:
"%s:" extra_id name of the "extra"
followed by any amount of textual data (except for the "Alignment"
extra_id, see below), and ending with:
"\n" This is not necessarily the
first "\n" associated with the
extra_id, but if the extra_id was
not recognized, the characters
following a "\n" are scanned for
"%s:" before the next "\n" to try
to match a known extra_id
The end of the entire extras section is marked by a second consecutive
newline:
"\n" end_extras Unless the extras section is empty,
it therefore ends with "\n\n".
The following extra_id names have meanings in version 7 PDB files:
"Alignment", "Major-Order", "Primitive_Types", "Offset", "Version", and
"Casts". These are listed in rough order of importance for interpreting
the data in a PDB file. Here are the formats for these extras blocks:
"Alignment:" begin block which gives the
alignments of the primitive
data types within structures
byte char_align alignment boundary for char
byte ptr_align alignment boundary for pointers (*)
byte short_align alignment boundary for short
byte int_align alignment boundary for int
byte long_align alignment boundary for long
byte float_align alignment boundary for float
byte double_align alignment boundary for double
"\n" end block of alignments
Note: It would have been more consistent to print the alignments
in ASCII as "%d\001". The 7-byte format shown above would
present a problem if any of the alignments happened to be
one of the ASCII newline characters. In practice, this
doesn't ever happen, since the alignments are always
powers of two.
The "Alignment" extra corrects a critical oversight in the PDB
prim_info (see whole file format above) data. Namely, the byte offset
of a structure member cannot be calculated without knowing whether
the target machine/compiler places alignment restrictions on the
various primitive data types. Therefore, until the "Alignment"
extra has been read, no compound data type defined in the structure
chart has a precise meaning. And yes, it is annoying that the
extras section cannot be read before the symbol table, and the
data types used in the symbol table are defined in the structure
chart, and the structure chart cannot be interpreted without the
"Alignment" extra.
Note that the disk address of a variable is NOT necessarily
aligned to be a multiple of the alignment of its data type. Alignment
applies only to the offset of a structure member from the beginning of
a structure instance. The alignment of a structure member which is
itself a compound data type is computed as the largest alignment of
any of its own members. (There exist machines and compilers for which
this calculation is incorrect, but even on such machines, practical
examples of structures which fail the simple alignment calculation
required by the PDB file format are rare.) In practice, alignments
are always powers of two, so the largest alignment of any member of
a structure is also the least common multiple of all the member
alignments.
"Major-Order:" begin Major-Order block
"%d\n" dim_order "101" if first dimension varies
slowest (default)
"102" if first dimension varies
fastest
The "Major-Order" extra MUST be interpreted in order to make
sense of the structure chart and symbol table, since it changes the
meaning of the dimension lists in both structure member descriptors
and variable definitions. If the "Major-Order" extra is not present,
the default is that the first dimension listed is the slowest varying
dimension, and the last dimension listed is the fastest varying.
Thus, a structure member with the descriptor "int x(2,3)" and the
default dim_order means that the six associated values are, in
order, x(0,0), x(0,1), x(0,2), x(1,0), x(1,1), x(1,2). If the
dim_order is 102, the same descriptor describes the six values in
the order x(0,0), x(1,0), x(0,1), x(1,1), x(0,2), x(1,2).
Again, it is an annoyance that the "Major-Order" is known only
after the symbol table has been read.
Unless the "Major-Order" extra is properly interpreted, the
topology of multidimensional arrays in a PDB file will be wrong.
"Primitive-Types:\n" begin Primitive-Types block
The Primitive-Types block is an adjunct to the structure chart,
which allows primitive data types other than char, short, int, long,
float, and double to be defined. These primitive types can be used
as the <base_type_name> either in the symbol table, or in a member
descriptor in the structure chart, just like any other data type.
Each primitive type begins:
"%s\001" base_type_name
the name of the primitive data type
"%ld\001" size byte size of one instance of the
primitive type on disk
"%d\001" alignment the alignment; the byte offset of
a structure member of this data type
will always be a multiple of this
"%d\001" order 1 if the most significant byte is
first, 2 if the least significant
byte is first, and -1 if f_flag is
"NO-CONV" or "FLOAT"
"%s\001" p_flag "ORDER" if a byte order permutation
follows, else "DEFORDER"
"%d\001"[size] permutation if p_flag=="ORDER", the permutation
is listed as size ASCII numbers
These are in the same order as the
permutations in the prim_info above.
"%s\001" f_flag "NO-CONV" if the data is opaque,
"FIX" if the data should be
transformed as an integer, and
"FLOAT" if the data should be
transformed as a floating point
"%ld\001"[8] fp_format if f_flag=="FLOAT", the 8 numbers
parameterizing the floating point
layout are listed as ASCII numbers
These are in the same order as the
floating point descriptions in
prim_info above, except that the
exponent bias is added as an eighth
element of fp_format.
"\n" end_primitive ends the definition of this
primitive data type
The end of the entire Primitive-Types block is marked by:
"\002\n" If there are no additional primitive
types, the entire block is
"Primitive-Types:\n\002\n"
Once again, the Primitive-Types information is required to
interpret the meaning of the structure chart and symbol table, but
cannot be read until after the symbol table.
"Offset:%d\n" default_origin
Specifies the default dimension
origin for member descriptor
dimension lists in the structure
chart. The default default_origin
is zero.
"Version:%d|%s\n" version, date
PDB version number (7 for the format
described here) and file creation
date string
"Casts:\n" begin Casts block
The Casts block provides additional information about members of
data structures which are pointers. Specifically, another member
of the data structure may be of type char *, and point to a string
which is of the form <base_type_name><pointer_indicator>, which is
the "true data type" of the pointee (the <base_type_name> in the
first member descriptor is just a dummy, like void *). Whether or
not this information is of any use depends on the programming
interface. In any event, the only possible use is in writing new
instances of the structure, since on read, the "true type" of the
pointee is always known. For each type-cast member of a data
structure, there is one entry of the form:
"%s\001" base_type the <base_type_name> of the
data structure containing the
cast_member and type_member
"%s\001" cast_member the <member_name> of a member
with a pointer data type
"%s\001\n" type_member the <member_name> of a member
of type char *, whose value
(after dereference) contains
a string representing the
"true type" of the pointee
from the cast_member
The end of the Casts block is marked by:
"\002\n" If there are no casts, the entire
Casts block is "Casts:\n\002\n".
The end of the entire extras section is marked by a second consecutive
newline:
"\n" end_extras end of extras section is thus always
"\n\n" (unless it is empty)
3F. Pointee format
-----
The PDB pointer/pointee format is not optimal, but it does model
several of the most important practical uses of pointers in the C
programming language. Any variable with a full_type containing one or
more trailing asterisk "*" characters is a pointer variable, and any
member descriptor with asterisks preceding the <member_name> is a
pointer member. The pointer itself has no representation at all in
the PDB file. A pointer member has a size and alignment within its
data structure, but the bytes stored there are garbage. A pointer
variable takes up no space at all on disk; its address is the address
of the first pointee (the only pointee if the variable is a scalar).
Every pointee consists of a descriptive header of indeterminate
length, possibly followed by the pointee data itself. A header-only
pointee contains the disk address of the pointee which is followed by
the data. This allows multiple pointers to the same data without
multiple disk copies of the data. The format of a PDB pointee is:
"%ld\001" nitems the number of objects in the pointee
(1 if it is a scalar; otherwise the
pointee is interpreted as a one
dimensional array)
"%s\001" full_type <base_type_name><pointer_indicators>
specifying the data type of the
pointee
"%ld\001" address disk address of beginning of this
pointee if data_here!=0, otherwise
the address of the pointee with
data_here!=0 containing the data
"%d\001\n" data_here 1 if the first byte of the data
immediately follows the "\n",
0 if the address points to another
pointee, which is guaranteed to have
data_here==1
<any> data only present if data_here!=0
A NULL pointer is marked by a pointee with nitems==0, address==-1,
and data_here==0.
The full_type of a pointee need not agree with the type expected
from the nominal type of its pointer. In effect, every PDB pointer is
a C void *, since the pointee contains the data type and number of
items. (The original PDBLib programming interface, however, requires
the Casts extra in order to be able to write data of a different type
than expected on the basis of the pointer declaration; a pointer
variable of a type different than one dereference of its full_type
cannot be written at all with this interface. The Casts extra is not
required for PDBLib to correctly read "cast" data in either pointer
variables or pointer members. The features or limitations of a
particular programming interface have no bearing on the format of a
PDB file, in any event.)
The major drawback of the PDB pointer/pointee format is that
nothing at a known address actually points to the first pointee; the
pointee addresses are stored only in the pointees themselves. The
address of a pointee is determined as follows:
The address of the first pointee of a pointer variable is the
address of the variable. (Since the pointers themselves have no disk
representation, they are not written).
The address of the pointee corresponding to the first pointer
member of the first element of an array of structure instances is the
address of the byte immediately following the array of instances.
(The structure instance array actually takes up space on disk, even if
it consists entirely of pointers. The size and alignment of pointer
members are specified in the small header for the whole file and in
the Alignment extra, respectively. The value of the pointer member
itself is meaningless.) When the first pointee has been completely
written, including any pointees corresponding to pointer members of
its own data type, the pointee corresponding to the second pointer
member of the first element of the structure instance array is written
starting at the address following the first pointee and all its
descendants. This continues until the last pointer member of the
first array element, after which comes the first pointer member of the
second array element, and so on.
The recursive algorithm used to write or read a PDB pointee can be
schematically indicated by a recursive function pdb_object, which
performs serial I/O on an array of nitems objects of type full_type,
beginning at a specified disk address, and returning the address of
the byte following what has just been read or written. Any actual
interface would be substantially more complicated than this, since
additional input arguments would be required to specify data to be
written, and additional output arguments would be required to return
the data read. Nevertheless, here is the schema:
long pdb_object(full_type, nitems, address)
{
if ( is_a_pointer(full_type) ) {
while (nitems--) {
address= read_or_write_pointee_header(address);
if ( not_seen_before(address) )
address= pdb_object( dereference_type(full_type), 1, address );
}
} else {
address= read_or_write_object_array( full_type, nitems, address );
while (nitems--) {
while (pointer= next_pointer_member(full_type)) {
address= pdb_object( pointer_type(pointer),
pointer_nitems(pointer), address );
}
}
}
return address;
}
Note that this algorithm does not permit partial read or write
operations on array pointer variables or structure instances
containing pointer members. The addresses of the pointees are only
revealed by performing the entire sequence of read or write operations
on a complete variable.
3G. PDB parameterization of floating point layouts
-----
Converting from one floating point format to another is far easier
than converting from any floating point format into a textual
representation of a number. In general, a floating point conversion
to any format is not much harder than a conversion to the particular
big-endian IEEE format preferred by the netCDF format.
The PDB parameterization of floating point layouts encompasses all
machines where the floating point size is a multiple of an 8-bit byte,
the exponent is binary, and the exponent and mantissa are contiguous
sequences of bits for some permutation of the bytes. Cray 128-bit
floating point formats (which have two interrelated exponents) and the
hex exponent formats used by a few old mainframes are the only
significant floating point formats not covered by the PDB
parameterization; such machines must convert their internal formats to
a form that is covered in order to read or write a PDB file, just as
they must do a conversion to read or write a netCDF file.
The PDB parameterization has three parts: The permutation, the
specification of the bit addresses and bit sizes of the sign,
exponent, and mantissa, and the bias of the exponent. Using the same
notation as in the XDR RPC1014 protocol specification (sections 3.6
and 3.7), the value of a floating point number with sign S, exponent
E, and mantissa (or fractional part) F is
(-1)^S * 2^(E-bias) * 1.F
where ^ represents exponentiation, * represents multiplication, and
1.F means 1 + (F / 2^(number of bits in mantissa)). Zero is always
represented by all bits of S, E, and F zero. There must be some
permutation of the bytes of the floating point number such that the
bytes containing E (and F) are contiguous and ordered from most
significant bits of E (and F) to least significant. In this big-endian
style order, the bits can be numbered from zero to one less than eight
times the number of bytes. The S, E, and F fields can then be
described by specifying the bit on which they start (bit addresses),
and the number of bits over which they extend (bit size). The sign
always has a bit size of 1.
In a PDB file, the permutation is a list of all the numbers from 1
to sizeof(float) or sizeof(double) where the value 1, 2, 3, and so on
represents the location of a byte in the standard big-endian byte
order defined above, and the position of the value in the list
represents the position of that byte in the actual floating point
number. Hence, the permutation of a for a float on a Sun SPARCstation
or in an XDR file is {1, 2, 3, 4} (standard big-endian order), while
the permutation of a float on a DECstation 3100 is {4, 3, 2, 1}
(standard little-endian order). The VAX is the only machine with
floating point formats having a non-monotonic permutation. A VAX is a
little-endian machine, but its floating point format is big-endian
with respect to 2-byte words, with each 2-byte word little-endian.
The resulting PDB permutation for a VAX float is {3, 4, 1, 2}.
The PDB convention for specifying floating point bit sizes and
addresses is:
byte bits_per_word 8 * number of bytes (redundant)
byte exponent_size number of bits in exponent
byte mantissa_size number of bits in mantissa
byte sign_address bit address of sign (in standard
byte order as described above)
byte exponent_address bit address of exponent
byte mantissa_address bit address of exponent
byte mantissa_flag 0 if high order bit of mantissa
is preceded by implicit 1 (1.F)
1 if high order bit of mantissa
is explicitly the 1 (always set
except in representation of 0.0)
long exponent_bias bias of the exponent (must be less
than 2^31 to fit into a long on all
machines, but this is not a
practical limitation)
The mantissa_flag allows for one more difference between floating
point formats: the 1 in 1.F is sometimes explicitly included as the
first bit of F. This is the case for the Cray floating point format,
and for 10 and 12 byte floating point formats on several platforms.
A few explicit examples of this parameterization should remove all
doubts about its meaning:
#E #F &S &E &M 1? bias
float {32, 8, 23, 0, 1, 9, 0, 127} (netCDF/XDR standard)
double {64, 11, 52, 0, 1, 12, 0, 1023} (netCDF/XDR standard)
(Above two cover the vast majority of modern machines,
which are distinguished only by the permutation.)
float {64, 15, 48, 0, 1, 16, 1, 16384} (Cray 1, XMP, YMP)
float {32, 8, 23, 0, 1, 9, 0, 129} (VAX)
double {64, 8, 55, 0, 1, 9, 0, 129} (VAX H-format)
double {64, 11, 52, 0, 1, 12, 0, 1025} (VAX G-format)
(permutations of VAX doubles are {2, 1, 4, 3, 6, 5, 8, 7})
double {96, 15, 64, 0, 1, 32, 1, 16382} (MacIntosh long double)
(Bits 16-31 unused in this format.)
Note that the permutation is not uniquely specified for the Cray
and MacIntosh long double formats. In such a case, the closest
permutation to one of the monotone permutations should be selected.
Another way to say this is to require that in the standard big-endian
order, the E (exponent) field should always have a smaller bit address
than the F (mantissa) field, and the location of any unused bytes
relative to E and F should be preserved.
3H. Restrictions on characters used in names
-----
The use of "\001", "\002", and "\n" as separators in PDB string
formats precludes their use in variable names, type names, or member
names. Using the null byte 0x00 in any name would make it far more
difficult to write a C program to access PDB files, so this character
is illegal in any names as well. Hence 0x00, 0x01, 0x02, 0x0a, 0x0d,
and 0x1f may never appear in any PDB string converted with %s in the
above description.
Another absolute proscription follows from the method used to
indicate the type of a pointer variable, namely, a data type name
(either from the structure chart or from the Primitive-Types extra
block) may not contain any space, tab, or asterisk characters, " ",
"\t", or "*", that is 0x20, 0x09, or 0x2a.
For the same reason, structure member names may not include spaces,
tabs, or asterisks. Additionally, no structure member name may
contain either character marking the beginning of the dimension list,
"(" or "[", that is, 0x28, or 0x5b.
A more generic limitation is that no "\n"-terminated sequence of
characters in the extras section begin with "%s:" where the string
could be mistaken for any present or future extra_id name. For
example, a data type name which appears in the Casts extra block had
better not have a name like "Alignment:". This ugly possibility can
be eliminated by more careful design of future extra block formats; if
such a measure is not taken, then everyone sharing PDB files will need
to be sure they are using the same version of the programming
interface. For now, this is not really a practical problem.
One final warning is that the original PDBLib programming interface
imposes naming restrictions beyond those intrinsic to the PDB file
format, as just described. These are as follows: the proscription of
the characters "(" and "[" used to introduce dimension lists is
extended to variable names in addition to structure member names.
Furthermore, the period, "." or 0x2e is illegal both in variable names
and in structure member names. (These restrictions arise because the
PDBLib interface cracks ASCII text strings to perform partial read and
write operations. If non-text-based partial read and write functions
were available, the additional restrictions on characters in file
names would disaapear.)
The restrictions on characters used in PDB names are summarized in
the following table:
variable data type structure member
names names names
0x00, 0x01, 0x02, NEVER NEVER NEVER
0x0a CR, 0x0d LF, 0x1f US
0x09 TAB, 0x20 SPACE - NEVER NEVER
0x2a "*"
0x28 "(", 0x5b "[" BAD - NEVER
0x2e "." BAD - BAD
Here "BAD" means that the highest level functions in the original PDBLib
programming interface will fail. Needless to say, the best idea is to
avoid any character in the above list under all circumstances.
------------------------------------------------------------------------------
4. A Generic Binary Data Description Language
---------------------------------------------
Unidata has provided a plain-text representation for netCDF files
(CDL format) which is extremely useful. The following plain-text
format can of describe either a netCDF or a PDB file (as well as HDF
and the majority of one-of-a-kind binary file formats which have been
designed to store scientific data).
The PDB format provides two capabilities lacking in the netCDF
format: non-XDR primitive data formats, and definable data types
including compound data structures and additional primitive types.
The netCDF format offers two capabilities lacking in the PDB format:
history records, and variable attributes.
In a PDB file, the want of a formal provision for history records
or for variable attributes can be fulfilled by variable naming
conventions. The trick is to use special naming conventions to
associate related variables ("x-units" might be the variable used to
store the "units" attribute of the variable "x"), or to imbue a
special significance to some of the variables in the file (instances
of a history record sequence might be named "rec0000", "rec0001", and
so on). Such tricks result in data which is accessed nearly as
efficiently as with the netCDF interface to a netCDF file.
Similar tricks allow single instances of data structures, such as
FORTRAN common blocks, to be accessed in a netCDF file. However, an
array of structure instances is very difficult to handle in netCDF
format, and any file with primitive data formats other than XDR is
impossible.
Of course, the underlying simplicity of the netCDF format is really
a virtue, not a weakness. A physics code written in FORTRAN can't
really generate any data the netCDF format can't handle. In fact, in
designing a PDB format suitable for holding restart or post-processing
information for a physics code, one of the primary objectives is to
remain within the bounds of the data describable by the netCDF format.
Two considerations beyond the scope of the format of an individual
file are important. The first is robustness against unexpected
program or machine crashes when part, but not all, of the data in a
file has been written. The second is the difficulty of handling
extremely large files as a single unit -- it is much easier to deal
with a few dozen smaller files than one monster. Both problems
commonly arise with files used for storing history data, which is
generally stored one record at a time with a long pause between the
writing of one record and the next, and which may grow to a very large
size.
The usual PDB format, which places the self-descriptive information
at the end of the file, is not very robust. Unless the structure
chart, symbol table, and extras are written after each history record,
only to be overwritten by the next, a crash causes the data in the
file to be uninterpretable. The file format itself allows the
structure chart, symbol table, and extras section to precede the data,
as in a netCDF file; perhaps such a strategy should be adopted to deal
with this problem.
A very general way to deal with the problem of robustness is to
keep two files open. Whenever new data is written at the end of the
data file, its description is written at the end of the description
file. The files can be merged when they are really finished, or left
as eparate components of a single whole. This two-file scheme has the
advantage that new data can be declared after some data has been
written, without sacrificing robustness in the face of program or
machine crashes. The plain-text binary data description language
defined below is a designed to be usable as the format for the
description member of such a pair.
The problem of dealing with very large files is easy to handle in
the case of history data -- just produce a family of files each having
a restricted length, instead of a single giant file. Splitting
history data across several files has a major impact on the
programming interface used to access the data, but no effect on the
format of an individual file. The most important thing to notice here
is that an attractive feature of the netCDF programming interface --
that you can retrieve values of a particular variable over a range of
times with a single subroutine call -- becomes much more difficult to
implement if the data extends over several files. Another remark is
that it is wise to copy the non-record data into each file in a
history family, so that each file "makes sense" by itself.
Bearing all of these remarks in mind, here is a generic binary data
description language, hereby christened "Clog", which can describe the
contents of any PDB or netCDF file, as well as many other binary file
formats:
4A. Notation
-----
The extended Backus-Nauer Form notation used in the RFC1014 XDR
standard is adopted here:
1. The characters |, (, ), [, ], and * are special.
2. Terminal symbols are strings surrounded by double quotes.
3. Non-terminal symbols are strings of non-special characters.
4. Alternative items are separated by the vertical bar character |.
5. Optional items are enclosed in square brackets [ ... ].
6. Items are grouped by enclosing them in parentheses ( ... ... ).
7. An item followed by * means zero or more occurrences of that item.
8. A non-terminal followed by : and a set of alternatives constitutes
the definition of that non-terminal.
Comments in a binary data description file begin with /* and end
with */, and are treated as whitespace. An identifier is a letter or
underscore "A-Za-z_", followed by zero or more letters, digits,
underscores, pluses, minuses, periods, or commas "A-Za-z_0-9,.+-". An
identifier may also consist of a quoted string, which is interpreted
as the characters within the quotes, recognizing the following escape
sequences:
\" double quote
\\ one backslash
\ooo an arbitrary 8-bit byte, except that \000 and any following
characters are ignored
An identifier may not be more than 1023 characters long, in its printed
form including the open and close quotes, if any.
A number is a sequence of one or more decimal digits optionally
preceded by a minus "-". A float_number is anything readable by the
standard C library "%e" format directive. All control characters and
spaces are treated as whitespace, principally to allow for any
differences in the newline character among various operating systems.
4B. Overview of language
-----
The binary data description language Clog is modeled on C variable
declaration and structure definition syntax. C declarations relate a
variable name, data type and dimension information. Clog variable
declarations must additionally specify the disk address for the
variable. Clog structure defintions are similarly extended to allow
the offset of each member to be specified. This extension makes it
easier to automatically generate a Clog description of some binary
file formats.
Following the PDB file format, Clog allows the bit-by-bit format of
the primitive data types to be specified. This greatly increases the
set of binary files describable using Clog. This same mechanism
enables new primitive data types to be declared.
Following the netCDF file format, Clog has a formal mechanism for
describing a sequence of history records. This allows natural
descriptions of an important class of binary files (including netCDF
files).
Finally, the Clog contains a formal means for including new types
of descriptive information not envisioned at its inception. (This has
been a valuable feature of the PDB file format.) Separate "trial" and
"standard" extension syntax is provided. The rules for Clog
extensions are simple: No Clog extension can alter the meaning of any
previously defined part of Clog (thus, nothing like the "Major-Order"
extra block in the PDB file format is acceptable). And any Clog
extensions should be conceived as supplying supplemental information,
rather than as wholesale replacements of existing features to get
additonal functionality.
4C. Basic primitive data types
-----
There are six basic primitive data types:
char an 8-bit byte
short a signed integer of at least 2 bytes
- used when small size is important
int a signed integer of at least 2 bytes
- most efficient integer type, used for boolean values
long a signed integer of at least 4 bytes
- most commonly used integer type, e.g.- an array index
float a floating point number of at least 4 bytes, range is
at least 10^(+-38), precision at least 6 decimal digits
double a floating point number usually 8 bytes, range is
at least 10^(+-38), precision at least 9 decimal digits,
but usually at least 10^(+-307) and 14 decimal digits
No data structure (compound data type) or additional primitive may
have one of these six names. Any one of these six names may be used
as a data type without any definition. All other identifiers used as
data type names must be previous defined, with the following two
exceptions:
string a string of 8-bit ASCII characters not containing '\0'
- a string is represented as a long containing the
disk address of the string; the string itself is
a long (aligned as a long) with a non-negative
count of the non-0 characters in the string (i.e.-
the result of the ANSI C strlen function), followed
by that many characters.
pointer a pointer to an array of any type
- the pointee contains the data type and dimension
information; the pointer is represented as a long
containing the disk address of the pointee
If string or pointer is used as a data type without defining it, the
default meaning is as shown. Once used, the string or pointer data
type may not be redefined. A NULL pointer or string is represented by
a disk address of -1; there is no associated pointee.
4D. Clog file layout
-----
clog_description:
"Contents Log"
basic_statement*
[ record_initializer basic_statement* record_declaration* ]
[ end_of_data ]
basic_statement:
primitive_definition
| structure_definition
| variable_declaration
| alignment_spec
| other_information
record_initializer:
record_declaration
| record_begin
end_of_data:
"+" "eod" "@" disk_address
The clog_description must begin with the QUOTED string "Contents
Log" -- if it is not quoted, the Clog lexical rules will divide it
into two tokens. Note that comments and white space may precede the
"Contents Log" token.
The clog_statement's order is restricted by a general "definition
before use" rule, as described in detail below. Basically, this means
a data type must be defined before it can be used. If the
record_initializer is present, clog_statements before it describe
non-record variables, while clog_statements afterwards describe record
variables. Notice that all record_declaration statements, except
possibly the first, follow the clog_statements which declare the
record variables. Thus, the structure of the records in a Clog
description cannot change.
If present, the +eod statement must be the very last thing in a
Clog description of a file, even beyond any comments. The
disk_address specified is the address of the first byte beyond all of
the data in the file. This may be beyond the end of any variable
declared in the Clog for any number of reasons; the most obvious is
that there may be pointees beyond the last declared data. The entire
+eod statement from the initial "+" to the final digit of the
disk_address must not occupy more than 80 characters.
In addition to specifying a safe address at which to begin adding
data to the binary file, the +eod statement allows the entire Clog to
be appended to the end of the binary file itself to make a single,
self-descriptive package. (Note that this procedure does not damage
either a netCDF or a PDB file.) This is done as follows:
At the address specified in the +eod statement, write the entire
plain-text Clog description of the file, including the final +eod
statement. Then close and truncate the file. A program which opens
the file can scan the last 80 bytes; if it finds a +eod statement, it
can check that "Contents Log" is the first token after specified
address, and, if so, interpret the Clog to determine the layout of the
binary data in the file.
4E. Variable declaration
-----
variable_declaration:
type_name variable_name dimension_spec* ["@" disk_address]
("," variable_name dimension_spec* ["@" disk_address])*
type_name: identifier
variable_name: identifier
dimension_spec:
"[" dimension_length [dimension_name] "]"
| "[" minimum_index ":" maximum_index [dimension_name] "]"
dimension_name: identifier
dimension_length: number
minimum_index: number
maximum_index: number
disk_address: number
The disk_address is a byte address. If omitted, the default is the
next available address after all the variables previously declared.
The "next available" address may be rounded up for alignment purposes,
as discussed in more detail below.
If more than one dimension_spec is present, the slowest varying
dimension is first in the list, and the fastest varying dimension
last. This is the C convention. As in C, a multidimensional array is
best regarded as an array of arrays -- the first index specifies which
array, so the second index must vary faster.
The optional dimension_name is provided for easier compatibility
with the netCDF format. Behavior is undefined if the same
dimension_name is used for dimension_spec's with different lengths.
The idea is to distinguish between dimension lengths which are
accidentally equal, and those which are equal by virtue of their
variable's meanings. If not suppied, the default dimension name is
"_%ld", where "%ld" is the decimal representation of the dimension
length.
The alternative minimum_index:maximum_index syntax is intended to
suggest the preferred range of the index values. The equivalent
dimension_length is maximum_index-minimum_index+1. This information
is far less important than the dimension_length, since the
dimension_length values (in their proper order!) specify the topology
of the array -- that is, how to find nearest neighbors along the
various dimensions.
The type_name and variable_name identifiers must be unique among
all other type_name and variable_name identifiers, respectively.
However, there are two separate name spaces, so a type_name identifier
may match a variable_name identifier without conflict. The
dimension_name identifiers, if any, form a third independent name
space.
4F. Primitive data type definition
-----
primitive_definition:
"+" "define" type_name
"[" size_value "]" "[" alignment_value "]"
[ "[" order_value "]"
["{" sign_address exponent_address exponent_size
mantissa_address mantissa_size mantissa_flag
exponent_bias "}"] ]
| "+" "define" "string" "standard"
| "+" "define" "pointer" "standard"
size_value: number
alignment_value: number
order_value:
number
| "sequential"
| "pdbpointer"
sign_address: number
exponent_address: number
exponent_size: number
mantissa_address: number
mantissa_size: number
mantissa_flag: number
exponent_bias: number
The type_name must neither have been defined nor referenced
previously.
All primitive type definitions must have a size_value (the number
of bytes one instance occupies) and an alignment_value (the largest
number by which the byte offset a structure member of this type is
guaranteed to be divisible).
The order_value, if a number, determines the byte ordering within
the size_value. If order_value is not present, the primitive type is
to be regarded as opaque. If order_value is present, but { ... } is
not present, the primitive type is an integer value. If { ... } is
present, the primitive type is a floating point value (order_value
must be present also in this case).
If the order_value is the identifier "sequential", then any
instance of this primitive requires sequential I/O; that is, portions
of arrays of this type may not be read or written. If the size_value
is 0, then the size of an instance is indeterminate. Otherwise, an
instance of the object occupies the specified size and has the
specified alignment as a structure member. The code reading or
writing the file is responsible for recognizing the name of a
sequential primitive and taking appropriate action to read or write
it.
The parameterization of a floating point format is the same as
described above for the PDB file format. The order_value is a
simplification of the general byte permutation provided by the PDB
file format. The meaning of the order value is as follows:
order_value ==> 1. The magnitude of the order_value represents
the number of bytes per "word". Within a
word, the byte order is monotone (either from
most significant to least significant or vice
versa). The magnitude of the order_value is
thus a multiple of the size_value. If the
entire object has monotone byte order, then
the magnitude of the order_value is one. In
practice, this is always the case, except for
the VAX floating point formats.
2. The sign of the order value determines the
word order. The byte order within a word is
always opposite to the word order. (Otherwise
the entire word is monotone, the word size is
one, and the word order is the byte order.)
The sign is positive if the most significant
word is first, negative if the least
significant word is first.
3. An order_value of zero is equivalent to omitting
the order_value altogether; it indicates opaque
data with the specified size and alignment.
In brief, the vast majority of numeric formats fall into one of the
following four categories:
order_value= 1 for big-endian (MSB first) machines
order_value= -1 for little-endian (LSB first) machines
order_value= 0 for opaque data
order_value= 2 for VAX floating point formats
The precise order of definition of primitive types is significant if
the file contains pointers to objects of non-predefined types. In
this case, the data type of the pointee will be encoded as an ordinal
based on the order of +struct and +define definitions. The exception
is a +define with a type_name of one of the predefined types: char,
short, int, long, float, double, string, or pointer. The order of
such a +define is unimportant, provided only that it precede the first
use of that predefined type. As explained in the next section, a
+define of type long must also precede any uses of the predefined
string or pointer types.
To use the default definitions of string and pointer, you must NOT
specify them using a +define (doing so would redefine them to the
meaning specified in the +define), OR you must use the special
"standard" forms of +define. With their default meanings, the a
string and a pointer are represented as a long.
In general, +defines of the six basic primitive data types should
precede any other statements in the Clog description of a file. If
these are not defined, the default is the standard for the machine on
which the binary data file was written. Without the basic primitive
type definitions, therefore, a Clog description of a file is not
portable across different machine architectures. As an example, a
Clog describing a netCDF file (in XDR format) would begin:
+define char [1][4][1]
+define short [2][4][1]
+define int [4][4][1]
+define long [4][4][1]
+define float [4][4][1] {0 1 8 9 23 0 127}
+define double [8][4][1] {0 1 11 12 52 0 1023}
Note that, since a netCDF does not support data structures, the
alignment_value is not really significant. For the same reason, the
definition of int is unnecessary. Furthermore, a definition of a
synonym for char would be appropriate:
+define byte [1][4][1]
4G. Compound data structure definition
-----
structure_definition:
"+" "struct" type_name "{"
full_member_definition member_definition* "}"
full_member_definition:
type_name member_name dimension_spec* ["@" byte_offset]
member_definition:
full_member_definition
| "," member_name dimension_spec* ["@" byte_offset]
member_name: identifier
The leading "+" permits immediate recognition a structure_definition
as opposed to a variable_declaration, without the necessity for making
"struct" a reserved word in the context of a type_name.
If present, the byte_offset specifies the byte offset of the member
from the beginning of an instance of the data structure. If absent,
the byte offset is the next available byte beyond all previously
declared members; the first member has a byte_offset of 0 by default.
The "next available" byte offset is always rounded up to the nearest
multiple of the alignment value for the type_name of the member.
The alignment value of a compound structure is the largest alignment
value for any of its members (see the discussion of alignment within
data structures below). Despite the fact that the byte_offset syntax
allows it, no two members of a data structure may overlap.
The body of each structure definition has its own name space, so
the member_name need only be unique among all the member_names for
the structure currently being defined.
As usual, type_name in a member_definition must be either a
predefined primitive, or must have been previously defined.
Obviously, a structure may not contain members which are instances of
itself.
Beyond this "define before using" requirement, the precise order of
defintion of structures is significant if the file contains pointers
to objects which are structures. In this case, the data type of the
pointee will be encoded as an ordinal based on the order of +struct
and +define definitions.
4H. Record definition
-----
record_begin:
"+" "record" "begin"
record_declaration:
"+" "record" "{" [time_value] "," [cycle_value] "}"
["@" disk_address]
time_value: float_number
cycle_value: number
The first occurrence of +record changes the meaning of subsequent
variable_declaration statements from declaring non-record variables to
declaring record variables. This first occurance may actually declare
the first history record itself, or it may merely be the record_begin
marker.
In any record_declaration, the disk_address defaults to the next
available address, as for a variable declaration.
The time_value and the cycle_value need not actually represent the
time and cycle number associated with the record; they can be any
double and long value which characterizes a record. Note that the
time_value is specified only to the accuracy it is printed; if a
precise time is required, the time should be made a record variable.
The intent of time_value and cycle_value is to provide more
informative "names" for the record than merely its position in the
sequence of records. If either time_value or cycle_value is omitted
in the first history record instance declaration, it must be omitted
in all following declarations; similarly, if present in the first
declaration, it must be present in all subsequent declarations.
Because of the possibility of families of time history files, it is
very difficult to realize a programming interface which allows
non-record data to be written after the writing of records has begun.
This practical difficulty is the reason for the division of the Clog
description into a non-record section, followed by a record section.
The restriction of history data to a sequence of a single type of
record, rather than allowing several interleaved sequences of records
of various types, is also deliberate. If several types of record need
to be written, several output files or file families should be used.
Note that a member of the history record structure may be a pointer
to a block of data whose size changes from record to record. For this
reason, Clog records are not necessarily layed out end-to-end in the
file, as are netCDF records. If the record addresses are random, the
efficiency of collection of a portion of the record data across
several or all records is reduced. More importantly, the Clog
description of a file will fail in practice if the storage of an
exceedingly large number of exceedingly small records is attempted.
As a rule of thumb, you're in trouble if your record length is less
than the number of bytes in the "+record" statement necessary to
declare the record (this can't be less than 8 bytes). If you are in
this category, you should strongly consider buffering several records
and writing them as an array for the sake of efficiency anyway.
4I. Additional alignment information
-----
alignment_spec:
"+" "align" alignment_type "[" alignment_value "]"
alignment_type:
"variables"
| "structs"
Some files, for example netCDF files, impose additional alignment
restrictions on variables. This can be specified using the "+align
variables" syntax in Clog. As a special case, alignment_value of 0
means that the same alignment should be applied to a variable in a
file as it were a member of a data structure. An alignment_value of 1
means that there is no padding between variables; every variabe starts
on the byte immediately following the previous variable. The
"@address" syntax in a variable declaration overrides the "+align
variables" statement. The special value 0 is the Clog default; netCDF
files would have "+align variables 4", and PDB files would have "+align
variables 1".
Some C compilers place an additional alignment restriction on
struct members which are themselves struct instances (beyond the usual
restriction that the alignment of a struct instance is the same as the
alignment of its most restrictively aligned member). Such an
additional alignment restriction may be expressed in Clog via the
"+align structs" syntax. The value 1 is the default, meaning that
there is no additional alignment restriction on struct instances.
At most "+align" statement of each type is allowed in a Clog
description. If present, these must come before any variables,
compound data structures, or records have been declared.
4J. Predefined string and pointer formats
-----
As indicated above, the type_names "string" and "pointer" have an
optional predefined meaning, which is designed to map relatively
easily into the C language "char *" and "void *" data types. In order
to do this, descriptive information unavoidably leaks from the
description of the binary data into the data itself. The following
design minimizes this leakage; nevertheless, a binary file designer
should avoid gratuitous use of indirect data types.
An instance of either string or pointer is represented on disk as a
long. Hence, no use of the default string or pointer types may
precede a +define of the long primitive. The value of that long is
interpreted as a disk address (as always, in bytes, with 0 meaning the
address of the first byte). A disk address of -1 is taken as a NULL
pointer, meaning that there is no data associated with the pointer.
In the case of a string, a character count of type long will be
found at the disk address specified in the (non-negative) pointer.
The characters of the string, if any, begin with the byte immediately
following the character count. The terminating NULL character is not
included in either the character count or the string itself.
In the case of a pointer, the (non-zero) pointer is the disk address
of a small header describing the pointee, followed by the pointee data
itself. The header is an array of long integers encoded as follows:
long type_number number representing the data type
of the pointee data:
0 char, 1 short, 2 int, 3 long,
4 float, 5 double,
6 string, 7 pointer,
>=8 for the data types defined using
+struct or +define in the order of
definition in the Clog
long n_dims number of dimensions (0 if scalar)
long[n_dims] length[n_dims] (not present if n_dims==0)
the number of elements along each
dimension, in order from slowest
varying to fastest varying dimension
<garbage> pad any pad necessary to align the
data to a disk address which would
be acceptable if it were an ordinary
variable
<type_number> data the pointee itself
4K. Generic extension syntax
-----
The preceding sections cover the basic requirements for being able
to decipher the contents of a binary data file. Of course, you can't
necessarily do anything with a bunch of numbers just because you can
read them. In general, the meaning of the numbers in a binary file
emerges only from careful documentation. The required level of
documentation is appropriate to a user's manual for the program which
wrote the file, not to the Clog description of the file.
Nevertheless, sometimes it is appropriate to carry a higher level
of meaning around with a binary data file. The Clog therefore provides
a generic syntax for such information:
other_information:
"+" public_extension [extension_id]
"{" extension_data "}" ["@" disk_address]
"-" private_extension [extension_id]
"{" extension_data "}" ["@" disk_address]
public_extension: identifier
private_extension: identifier
extension_id: identifier
extension_data: <any sequence of tokens with balanced "{" and "}">
The notion of a public_extension is that considerable effort be
expended to ensure that the associated identifier be unique across all
implementations of the Clog. A private_extension, on the other hand,
can be used immediately and freely at a single site.
The private_extension syntax should not be used as a substitute for
the comment syntax /* ... */.
A public_extension cannot have the identifiers "struct", "define",
"history", or "eod". Furthermore, the following public_extensions are
hereby defined, in order to prevent the corresponding inevitable
private extensions:
"+" "pedigree" "{" pedigree_spec ("," pedigree_spec)* "}"
pedigree_spec:
"created_by" "=" identifier
| "creation_date" "=" identifier
| "modified_by" "=" identifier
| "modification_date" "=" identifier
| "revision" "=" number
| "archive_id" "=" identifier
| "format_version" "=" number
| identifier "=" identifier
| identifier "=" number
Blue bloods and bureaucrats demand pedigrees for their data.
"+" "attributes" [variable_name]
"{" attribute_spec (";" attribute_spec)* "}"
attribute_spec:
attribute_name ["=" attribute_value]
attribute_name: identifier
attribute_value:
number ("," number)*
| float_number ("," float_number)*
| identifier
The attribute extension handles netCDF-style attributes. The
number and float_number tokens are extended by the suffix notation
described in the netCDF User's Guide from Unidata, in the section on
CDL format. An identifier as an attribute_value covers the case of a
string valued attribute; it should normally be a quoted string. If
the variable_name is not present, the attributes apply to the whole
binary file. If present, the variable_name must specify a previously
declared variable.
"+" "value" variable_name "{" variable_value "}"
variable_value:
number ("," variable_value)*
| float_number ("," variable_value)*
| "{" variable_value "}"
This extension is provided in order to be able to directly
translate Unidata CDL files into Clog files. Just as the string and
pointer data types cause a leakage of descriptive information into the
data, the +value extension amounts to a leakage of data into the
descriptive information. Each level of { ... } descends one level
into a structure instance.
"+" "PDBpointer" ("variable" | "member") "{" type_name "}"
The type_name must specify an opaque data type previously defined
by +define. Two separate types should be supplied; one for variables,
which has size==0, and one for structure members, which has non-zero
size. Both are sequential data types, since any object containing a
PDB-style pointer must be read sequentially as a complete block. The
following declarations would be reasonable:
+define "char *" [0][4][sequential]
+PDBpointer variable { "char *" }
+define "char *" [4][4][sequential] /* note extra space */
+PDBpointer member { "char *" }
These definitions assume that the sizeof(void *) specified in the PDB
prim_info section was 4, and the ptr_align specified in the PDB
Alignment extra block was 4. There is no limit on the number of
different type_names which can be declared to be PDBpointer, but all
of the corresponding +define statements must be identical.
"+" "PDBcast" type_name "{" member_pair ("," member_pair)* "}"
member_pair: cast_member "," type_member
cast_member: identifier
type_member: identifier
The PDBcast extension handles the information in the PDB Casts
extra block for PDB-style derived classes. Given the clumsy nature of
the PDBpointer public_extension, this does not really work very
well...